Figures:
Exchange Server 2013 SP1 CU7: 8 x MultiRole servers (6 CPU, 24 Gb memory), 92 Mailboxdatabases in one DAG with 3 copies evenly distributed across all 8 servers. 1 MBX Server for 3 recovery Databases.
SCOM 2012 R2 RU4 with latest Exchange MP (december 2014). (being implemented)
Facts:
All (but one) Exchange Servers are continously flipping between Monitored and Not Monitored: between 5 minutes and 1 hour (Ive increased Object Discovery for troubleshooting on all exchange servers)
All (but the same one as above) servers return Unknown on get-healthreport or get-serverhealth when scom status is Not Monitored and return expected output as SCOM reports monitored.
On all (but one) servers MSExchangeHMworker runs high on CPU (~50%) and scom reports Service terminated unexpectacly for this service about 45 times in 48 hour (2 of the servers 200 x in 48 hour)
Ive added a D: drive and the path is created with logfiles. Unfortunately the problem remains.
In Managed Availability eventlogs I see typical behaviour and events (Healthset X determined te be healthy and so on like MA is working as a charm (but querying MA via get-serverhealt doesnt work.)
I know for sure SCOM (and MA) were working fine; Ive seen all 9 servers Monitored (and unhealthy Healthsets from time to time) in mid-to-end 2014 when we run a pilot for scom. If I remember correctly CU4 and CU6 were implemented in the meantime, as far as I know without issues.
We have 2 Management Groups (the pilot group and the production group), both behave the same.
In eventlogs I can find no (obvious ?) events regarding MA not working so we have no idea where to start troubleshooting.
All servers (Vmware) were installed and configured the same way.